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Abstract- This  paper  presents  feature  extraction  and  estimation 
of  multifractal  measures  of  DNA  sequences,  using  a  multifrac¬ 
tal  methodology,  and  demonstrates  a  new  scheme  for  identify¬ 
ing  biological  functionality,  using  information  contained  within 
the  DNA  sequences.  It  shows  that  the  Renyi  and  Mandelbrot 
fractal  dimension  spectra  may  be  useful  techniques  for  extract¬ 
ing  the  information  contained  in  the  DNA  sequences. 
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I.  Introduction 

The  genome  of  an  organism  is  a  total  set  of  the  deoxyri¬ 
bonucleic  acid  (DNA)  molecules,  which  are  composed  of  the 
sugar-phosphate  backbone  and  four  nitrogenous  bases  (ade¬ 
nine,  A,  thymine,  T,  cytosine,  C,  and  guanine,  G),  repeated 
millions  of  times  throughout  the  genome.  The  genome  con¬ 
tains  the  master  blueprint,  which  is  mostly  stored  in  its 
genes,  each  responsible  for  making  a  single  protein,  for  all 
cellular  structures  and  activities  for  the  lifetimes  of  the  cell 
or  the  organism.  A  gene  is  a  piece  of  DNA  sequence  which  is 
composed  of  exons  and  introns  in  higher  eukaryotic  organ¬ 
isms.  It  has  been  long  known  that  the  coding  regions  (also 
known  as  exons)  of  the  genes  carry  information  which 
instructs  the  cellular  process  in  the  way  of  leading  from 
DNA  sequences  to  amino  acid  sequences  or  proteins,  while 
the  non-coding  regions  (the  introns  of  the  genes  and  the 
intergenic  regions)  contain  no  information  about  the  proteins 
in  the  organism.  The  proteins  in  the  organism  determine, 
among  other  things,  how  the  organism  looks,  how  well  its 
body  metabolizes  food  or  fights  infection,  and  sometimes 
even  how  it  behaves.  Therefore,  accurate  localization  of 
genes  and  other  parts  of  our  genome  may  lead  to  an  under¬ 
standing  of  the  genome  and  to  the  understanding  of  life. 

Recently  a  draft  sequence  of  the  human  genome,  which 
covers  96%  of  the  entire  human  genome  containing 
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3  x  10  base  pairs,  has  been  published  by  the  Human 
Genome  Project  (HGP)  and  Celera  Genomics.  However,  the 
rate  of  locating  genes  is  relatively  low,  and  about  35%  of 
human  genes  remains  unknown.  Therefore,  developing  com¬ 
putational  tools  for  locating  genes  and  elucidating  the  struc¬ 
ture  of  genes  is  becoming  essential  for  molecular  biology.  In 
addition,  new  computational  methods  can  provide  comple¬ 
mentary  information  which  can  be  of  benefit  to  locating 
genes  by  the  traditional  experimental  methods. 

Most  of  the  current  research  in  the  deciphering  the 
meaning  of  DNA  sequences  is  approached  from  the  low 


base-pair  level.  Its  main  objective  is  to  search  for  patterns  or 
correlations  existing  in  the  DNA  sequence  related  to  codons 
(three-base  sequences),  amino  acids,  and  proteins.  A  number 
of  gene  prediction  systems  have  been  developed  in  recent 
years.  These  systems  use  a  variety  of  sophisticated  computa¬ 
tional  techniques,  including  neural  networks  [1],  dynamic 
programming  [2],  rule-based  methods  [3],  decision  trees  [4], 
probability  reasoning  [5]  and  hidden  Markov  chains  [6]. 
Most  of  these  techniques  rely  on  the  statistical  qualities  of 
exons  in  the  genome  and  therefore,  the  fundamental  limita¬ 
tion  of  them  is  the  use  of  a  known  gene  data  pool  as  a  train¬ 
ing  set  for  their  classification.  Consequently,  they  are 
capable  of  finding  only  the  genes  that  are  homologous  with 
those  in  the  training  data  set. 

Kinsner  et  al.  have  demonstrated  that  fractal  techniques 
can  be  useful  in  the  classification  of  stationary  and  nonsta¬ 
tionary  signals  such  as  speech,  image,  and  radio  transmitter 
transients  [7].  Assuming  that  DNA  sequences  have  a  fractal 
structure,  Karlin  et.  al.  [8]  have  demonstrated  a  long-range 
power-law  relations  on  the  DNA  sequences,  spanning  104 
nucleotides.  Peng  et  al.  [9]  demonstrated  the  correlation 
properties  of  coding  and  non-coding  regions  of  DNA 
sequences,  using  Levy  walk  method  to  map  the  DNA  alpha¬ 
bet  sequence  into  a  numerical  sequence.  Other  researches 
have  shown  that  the  long-range  fractal  correlations  appear  in 
the  coding  region  of  the  DNA  sequences,  with  different  val¬ 
ues  in  different  regions  of  the  sequence  [10],  [11],  [12]. 
Alternatively,  Barral  et.  al.  [13]  reported  that  coding  regions 
behave  statistically  as  random  chains,  as  compared  to  non¬ 
coding  regions.  Yu  et  al.  [14],  [15]  proposed  a  time  series 
model  based  on  the  global  structure  of  the  complete  genome, 
and  showed  long-range  correlations  in  the  bacteria  DNA 
sequences.  Although  those  papers  present  various  algorithms 
in  DNA  research,  the  Levy  walk  or  a  modified  Levy  walk  are 
often  used  for  translating  a  DNA  alphabet  sequence  into  a 
numerical  sequence.  In  the  Levy  walk,  a  walker  either 
descends  or  rises  one  step  at  the  position  i  along  a  DNA 
sequence  chain  if  a  pyrimidine  (C/T)  or  a  purine  (A/G) 
occurs,  respectively.  Therefore,  an  artificial  error  is  intro¬ 
duced  by  giving  specific  values  to  the  sequence  at  each  base 
pair. 

Unequal  usage  of  codons  in  the  coding  regions  appears 
to  be  a  universal  feature  of  the  genomes  across  a  wide  range 
of  species.  The  bias  is  mainly  the  result  of  two  forces,  the 
amino  acid  usage  bias  and  the  unequal  usage  of  synonymous 
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codons  [16].  The  latter  could  be  correlated  with  the  muta¬ 
tional  biases  and  natural  selection  acting  at  the  levels  of  rep¬ 
lication,  transcription,  and  translation  [17],  [18].  In  other 
words,  each  organism  has  its  own  synonymous  codon  prefer¬ 
ences. 

In  this  paper,  we  introduce  a  new  algorithm  which  is 
using  multifractal  techniques  and  the  uneven  codon  usage 
between  coding  and  non-coding  regions  in  the  DNA 
sequences.  Based  only  on  the  structural  information  given  by 
the  DNA  sequences,  a  significant  difference  between  the 
coding  and  non-coding  regions  can  be  demonstrated  by 
using  our  algorithm  without  pre-training  data  sets.  There¬ 
fore,  we  demonstrate  a  new  way  of  locating  genes  in  an 
genome.  We  have  developed  a  technique  that:  (i)  transforms 
a  DNA  alphabet  sequence  to  a  DNA  numerical  signal 
sequence  using  a  specially  constructed  transform  matrix;  and 
(ii)  estimates  the  Renyi  and  Mandelbrot  dimension  spectra  of 
the  DNA  numerical  signal  sequence. 

n.  Background 

2. 1  Transcription,  Translation,  and  Open  Reading  Frame 
As  shown  in  Fig.  1,  a  genome  is  composed  of  genes  and 
intergenic  regions.  The  structure  of  the  genes  in  higher 
eukaryotic  organisms  usually  consists  of  a  number  of  small 
exons  (coding  portions)  separated  by  larger  introns  (non¬ 
coding  portions).  During  a  transcription  process,  a  messen¬ 
ger  ribonucleic  acid  (mRNA)  is  produced  based  on  the  corre¬ 
sponding  individual  base  pairs  of  the  coding  portions  of  a 
gene  (with  T  substituted  by  uracil,  U),  and  the  non-coding 
portions  of  the  gene  are  removed.  Within  a  translation  proc¬ 
ess,  each  specific  codon  from  the  mRNA  template  is  respon¬ 
sible  for  the  selection  of  a  corresponding  amino  acids.  In 


addition,  the  sequence  of  codons,  terminated  by  stop  codons 
and  the  poly-A  tail  in  the  mRNA  template,  is  responsible  for 
the  orderly  assembly  into  an  amino  acid  sequence.  The 
resulting  amino  acid  sequence  is  a  protein.  There  are  64  pos¬ 
sible  codons  corresponding  to  only  20  amino  acids  and  a 
stop  signal.  An  open  reading  frame  (ORF)  is  the  decoding 
sequence  which  is  translated  from  an  mRNA  template.  An 
ORF  contains  not  only  amino  acids  but  stop  signals.  There 
are  three  frames  according  to  an  mRNA  template  since 
three-base  DNA  sequence  represents  one  amino  acid.  Only 
one  frame,  which  is  the  ORF,  represents  the  amino  acid 
sequence.  The  other  frames  are  shifted  by  either  one  or  two 
bases  of  the  DNA  sequence  with  respect  to  the  ORF.  To 
locate  a  gene’s  position  is  to  determine  the  positions  of  its 
exons. 

2.2  DNA  Signal 

For  the  DNA  multifractal  analysis,  we  first  translate  a 
DNA  sequence  into  a  corresponding  DNA  numerical  signal 
sequence,  using  a  specific  character-to-number  translation 
matrix  which  is  constructed  based  on  the  general  assumption 
that  all  the  ORFs  of  the  coding  sequences  within  a  genome 
have  a  common  feature  of  codon  usage  bias,  while  the  non¬ 
coding  regions  and  all  the  shifted  frames  of  the  coding 
sequences  in  the  same  genome  do  not  have  a  codon  usage 
bias.  Then  we  treat  the  DNA  numerical  signal  sequence  as  a 
spatial  series. 

2.5  Renyi  and  Mandelbrot  Fractal  Dimension  Spectra 

Fractals  have  been  studied  extensively  in  physics  and 
mathematics.  A  fractal  dimension  demonstrates  the  degree 
of  complexity  (or  roughness,  brokenness,  and  irregularity)  of 
an  object  which  is  statistically  self-similar  to  some  extent 
[19],  There  are  many  distinct  definitions  of  fractal  dimen- 
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Fig.  1.  From  a  genome  to  proteins,  (a)  The  structure  of  the  higher  eukaryotic  genomic  DNA  and  (b)  A  schematic 
chart  of  the  transcription  and  translation  process.  (TSS,  transcription  start  site;  TIS,  translation  initiation  site; 
SCS,  stop  coding  site.) 
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sions  in  order  to  reflect  the  different  properties  of  self-similar 
and  self-affine  signals.  Morphological  fractal  dimensions 
reveal  the  dominant  fractal  properties  of  a  multifractal  sig¬ 
nal.  Using  multifractal  analysis,  such  as  the  Renyi  and  Man¬ 
delbrot  fractal  dimension  spectra,  more  information  in  the 
multifractal  signal  structure  can  be  revealed.  Therefore,  this 
approach  can  determine  if  the  signal  is  a  single  fractal  (with 
a  single-valued  fractal  dimension),  or  a  multifractal  (with  a 
spectrum  of  fractal  dimensions). 

The  Renyi  dimension  Dg  is  defined  as  [19] 
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where  r  stands  for  the  size  of  a  volume  element  (vel);  N r  is 
the  number  of  vels  for  a  given  vel  size  that  cover  the  fractal 
object,  pj  denotes  the  frequency  of  occurrence  in  a  given  vel, 
and  q  is  the  moment  order. 

The  Mandelbrot  dimension  DMan  and  the  Holder  expo¬ 
nent  aq  are  related  to  the  Renyi  dimension  by  [19] 

=  (2> 


DMan  =  qaq-(q-l)Dq  (3) 

The  Mandelbrot  spectrum  reveals  the  distribution  of  sin¬ 
gularities  in  the  fractal  object,  and  is  a  useful  tool  for  an  “on¬ 
line”  analysis.  Since  the  Mandelbrot  fractal  dimension  spec¬ 
trum  is  derived  from  the  Renyi  fractal  dimension  spectrum, 
there  should  be  similarities  between  them.  However,  there 
are  also  differences  that  may  be  useful  in  classification. 


III.  Results  and  Discussion 

3.1  DNA  Samples 

The  data  of  human  codon  usage  is  originally  from  Ike- 
mura  el  al.  [20].  For  the  testing,  we  construct  (i)  a  random 
DNA  sequence  which  is  generated  from  a  uniform  white 
noise  and  (ii)  a  Cantor  DNA  sequence  which  has  a  Cantor  set 
property.  A  piece  of  the  human  genomic  DNA  ph-20,  the 
Hyall,  and  Hyal2  cDNA  sequences  have  been  obtained  from 
the  GenBank.  The  genomic  DNA  sequence  contains  the 
human  gene  of  the  Hyal2  and  the  exonl  of  the  Hyall.  For  the 
cDNA  sequences,  we  remove  the  5'  end  and  3’  end  of  the 
non-coding  regions  before  testing. 

3.2  Renyi  Fractal  Dimension  Spectrum 

We  have  calculated  the  Renyi  dimension  spectra  of  the 
Cantor  DNA  sequence,  the  random  DNA  sequence,  and  the 
coding  regions  (cDNAs  and  exons)  and  non-coding  regions 
(introns)  of  the  real  DNA  sequences.  The  results  are  shown 
in  Fig.  2.  Figure  2(a)  shows  that  one  of  the  three  frames  of 


the  Cantor  DNA  sequence  has  a  single  fractal  dimension  of 
0.6288  (the  solid  line)  which  is  very  close  to  the  theoretical 
value  (Iog2/log3)  of  the  Cantor  set  fractal  dimension.  The 
other  two  frames  are  strictly  not  the  Cantor  set  since  they  are 
shifted  by  one  or  two  bases  and,  hence,  they  show  a  slight 
multifractality.  The  three  frames  of  the  random  DNA 
sequence  (Fig.  2(b))  have  exactly  the  same  Renyi  dimension 
spectrum  since  they  have  the  same  statistical  property.  For 
the  real  DNA  sequences  (Figs.  2(c)  to  2(D),  our  results  sup¬ 
port  Kinsner  and  Rifaat’s  conclusion  that  DNA  sequences 
are  multifractal  [21].  Only  the  ORFs  have  a  significant  dif¬ 
ference  of  the  Renyi  dimension,  as  compared  with  that  of  the 
shifted  frames  due  to  the  uneven  codon  usage  on  the  ORF. 
As  shown  in  Fig.  2(c)  and  2(d),  the  non-ORFs  have  shapes 
statistically  similar  to  the  white  noise  DNA  sequence.  For 
the  genomic  DNA  and  the  intron  (Fig.  2(e)  and  2(D),  the 
codon  usage  is  even  in  general  on  the  three  frames,  although 
the  genomic  DNA  contains  some  coding  regions.  Therefore, 
the  three  frames  of  the  genomic  sequence  have  the  same 
Renyi  dimension  spectrum. 

We  also  tested  different  flanking  regions  of  the  DNA 
sequences  and  there  is  a  slight,  but  not  significant,  difference 
among  the  Renyi  dimension  spectra  of  the  three  frames  (data 
not  shown),  which  reflects  the  GC-rich  phenomena  in  some 
areas  of  the  flank  regions  as  the  appearance  of  regulatory  ele¬ 
ments.  Our  algorithm  can  even  show  the  difference  of  the 
Renyi  dimension  spectra  between  the  non-coding  regions 
(including  intron  and  flanking  region)  and  the  non-coding 
genes  (data  not  shown),  although  the  differences  are  not  sig¬ 
nificant. 

3.3  The  Mandelbrot  Fractal  Dimension  Spectrum 

Figures  3(a)  to  3(D  show  the  Cantor  DNA  sequence,  the 
white  noise  random  DNA  sequence,  the  coding  region  of  the 
Hyall  cDNA,  the  exon  3  of  the  hyal2,  the  human  genomic 
DNA  sequence  (ph-20),  and  the  intron  1  of  the  Hyal2, 
respectively.  The  solid  lines  in  Figs.  3(b)  to  3(f)  and  the  point 
in  Fig.  3(a)  represent  the  ORFs  of  the  coding  regions  and 
frame  1  of  the  non-coding  regions.  The  dashed  lines  and 
dash-dot  lines  represent  the  shifted  frames  in  both  the  coding 
and  non-coding  regions.  Since  one  of  the  frames  of  the  Can¬ 
tor  DNA  sequence  demonstrates  a  single  fractal  property,  its 
Mandelbrot  dimension  and  Holder  exponent  are  constant  and 
therefore,  the  Mandelbrot  spectrum  is  degraded  to  a  single 
point.  Similar  to  the  Renyi  dimension  spectrum,  the  ORFs 
have  different  Mandelbrot  spectra  compared  to  that  of  the 
shifted  frames.  The  results  also  show  that  the  non-ORFs 
have  a  similar  Mandelbrot  spectrum  with  the  white  noise, 
indicating  that  the  non-ORF  and  the  white  noise  have  the 
same  multifractal  behaviour.  Therefore,  our  results  demon¬ 
strate  a  scheme  for  locating  coding  regions  within  the 
genomic  DNA  sequences. 


IV.  Conclusions 


In  the  DNA  sequences,  there  is  a  significant  difference 
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Fig.  2.  Renyi  dimension  spectrum  of  the  DNA  sequences:  (a)  a  Cantor  DNA  sequence,  (b)  a  white  noise  random 
DNA  sequence,  (c)  the  coding  region  of  the  Hyall  cDNA,  (d)  the  exon  3  of  the  Hyal2,  (e)  the  human  genomic  DNA 
sequence  (ph-20),  and  (f)  the  intron  1  of  the  Hyal2.  The  solid  lines  represent  the  ORF  (frame  1)  and  the  dash  and  dot- 
dash  lines  are  shifted  frames. 


Fig.  3.  The  Mandelbrot  dimension  of  the  DNA  sequences:  (a)  a  Cantor  DNA  sequence,  (b)  a  white  noise  random 
DNA  sequence,  (c)  the  coding  region  of  the  Hyall  cDNA,  (d)  the  exon  3  of  the  Hyal2,  (e)  the  human  genomic  DNA 
sequence  (ph-20),  and  (f)  the  intron  1  of  the  Hyal2.  The  solid  lines  represent  the  ORF  (frame  1)  and  the  dash  and  dot- 
dash  lines  are  shifted  frames.  In  Fig.  3(a),  the  solid  line  of  Fig.  2(a)  has  collapsed  to  a  dot. 
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of  the  Renyi  dimensions  between  the  ORF  and  the  shifted 
frames  of  the  coding  region.  The  three  frames  of  the  non¬ 
coding  region  have  similar  multifractality,  with  a  shape  simi¬ 
lar  to  the  uniform  white  noise.  Unlike  the  current  gene  pre¬ 
diction  algorithms,  our  multifractal  algorithm  is  carried  out 
based  exclusively  on  the  multifractal  structure  and  entropy 
properties  of  the  DNA  sequences,  and  does  not  need  pre¬ 
training  data  sets  for  program  training.  Therefore,  it  opens  up 
a  useful  way  of  classifying  coding  and  non-coding  regions  of 
the  DNA  sequences. 
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